---
id: openai
title: OpenAI
sidebar_label: OpenAI
---

import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
import VideoDisplayer from "@site/src/components/VideoDisplayer";
import ColabButton from "@site/src/components/ColabButton";

# OpenAI

<ColabButton 
  notebookUrl="https://colab.research.google.com/github/confident-ai/deepeval/blob/main/examples/notebooks/openai.ipynb" 
  className="header-colab-button"
/>


`deepeval` streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations, and online evaluations in production.

## End-to-End Evals

To begin evaluating your OpenAI application, simply replace your OpenAI client with `deepeval`'s OpenAI client, and pass in the `metrics` you wish to use.

<Tabs groupId="openai">
<TabItem value="chat-completions" label="Chat Completions">

```python title="main.py" showLineNumbers
from deepeval.openai import OpenAI
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.tracing import trace, LlmSpanContext

client = OpenAI()

goldens = [
    Golden(input="What is the weather in Bogotá, Colombia?"),
    Golden(input="What is the weather in Paris, France?"),
]

dataset = EvaluationDataset(goldens=goldens)

for golden in dataset.evals_iterator():
    # run OpenAI client
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
            expected_output=golden.expected_output,
        )
    ):
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": golden.input}
            ],
        )
```

</TabItem>
<TabItem value="responses" label="Responses">

```python title="main.py" showLineNumbers
from deepeval.openai import OpenAI
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.tracing import trace, LlmSpanContext

client = OpenAI()

goldens = [
    Golden(input="What is the weather in Bogotá, Colombia?"),
    Golden(input="What is the weather in Paris, France?"),
]
dataset = EvaluationDataset(goldens=goldens)

for golden in dataset.evals_iterator():
    # run OpenAI client
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
            expected_output=golden.expected_output,
        )
    ):
        client.responses.create(
            model="gpt-4o",
            instructions="You are a helpful assistant.",
            input=golden.input,
        )
```

</TabItem>
<TabItem value="async-chat-completions" label="Async Chat Completions">

```python title="main.py" showLineNumbers
import asyncio
from deepeval.openai import AsyncOpenAI
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.tracing import trace, LlmSpanContext

async_client = AsyncOpenAI()

async def openai_llm_call(input):
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
            expected_output=golden.expected_output,
        )
    ):
        return await async_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful chatbot. Always generate a string response."},
                {"role": "user", "content": input},
            ],
        )

goldens = [
    Golden(input="What is the weather in Bogotá, Colombia?"),
    Golden(input="What is the weather in Paris, France?"),
]
dataset = EvaluationDataset(goldens=goldens)

for golden in dataset.evals_iterator():
    task = asyncio.create_task(openai_llm_call(golden.input))
    dataset.evaluate(task)
```

</TabItem>
<TabItem value="async-responses" label="Async Responses">

```python title="main.py" showLineNumbers
import asyncio
from deepeval.openai import AsyncOpenAI
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.tracing import trace

async_client = AsyncOpenAI()

async def openai_llm_call(input):
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
            expected_output=golden.expected_output,
        )
    ):
        return await async_client.responses.create(
            model="gpt-4o",
            instructions="You are a helpful assistant.",
            input=input,  
        )


goldens = [
    Golden(input="What is the weather in Bogotá, Colombia?"),
    Golden(input="What is the weather in Paris, France?"),
]
dataset = EvaluationDataset(goldens=goldens)

for golden in dataset.evals_iterator():
    task = asyncio.create_task(openai_llm_call(golden.input))
    dataset.evaluate(task)
```

</TabItem>
</Tabs>

There are **FIVE** optional parameters when using `deepeval`'s OpenAI client's chat completion and response methods:

- [Optional] `metrics`: a list of metrics of type `BaseMetric`
- [Optional] `expected_output`: a string specifying the expected output of your OpenAI generation.
- [Optional] `retrieval_context`: a list of strings, representing the retrieved contexts to be passed into your OpenAI generation.
- [Optional] `context`: a list of strings, representing the ideal retrieved contexts to be passed into your OpenAI generation.
- [Optional] `expected_tools`: a list of strings, representing the expected tools to be called during OpenAI generation.

:::info
`deepeval` OpenAI client automatically extracts the `input` and `actual_output` from each API response, enabling you to use metrics like **Answer Relevancy** out of the box. For metrics such as **Faithfulness**—which rely on additional parameters such as retrieval context—you’ll need to explicitly set these parameters when invoking the client.
:::

## Component-Level Evals

You can also use `deepeval`'s OpenAI client **within component-level evaluations**. To set up component-level evaluations, add the `@observe` decorator to your llm_application's components, and simply replace existing OpenAI clients with `deepeval`'s OpenAI client, passing in the metrics you wish to use.

<Tabs groupId="openai">
<TabItem value="chat-completions" label="Chat Completions">

```python title="main.py" showLineNumbers
from deepeval.tracing import observe
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext

client = OpenAI()

@observe()
def retrieve_docs(query):
    return [
        "Paris is the capital and most populous city of France.",
        "It has been a major European center of finance, diplomacy, commerce, and science."
    ]

@observe()
def llm_app(input):
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
        ),
    ):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": '\n'.join(retrieve_docs(input)) + "\n\nQuestion: " + input}
            ]
        )
    return response.choices[0].message.content

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="...")])

# Iterate through goldens
for golden in dataset.evals_iterator():
    # run your LLM application
    llm_app(input=golden.input)
```

</TabItem>
<TabItem value="responses" label="Responses">

```python title="main.py" showLineNumbers {4,20}
from deepeval.tracing import observe
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext

@observe()
def retrieve_docs(query):
    return [
        "Paris is the capital and most populous city of France.",
        "It has been a major European center of finance, diplomacy, commerce, and science."
    ]

@observe()
def llm_app(input):
    client = OpenAI()
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
        ),
    ):
        response = client.responses.create(
            model="gpt-4o",
            instructions="You are a helpful assistant.",
            input=input
        )
    return response.output_text

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="...")])

# Iterate through goldens
for golden in dataset.evals_iterator():
    # run your LLM application
    llm_app(input=golden.input)
```

</TabItem>
<TabItem value="async-chat-completions" label="Async Chat Completions">

```python title="main.py" showLineNumbers
import asyncio
from deepeval.tracing import observe
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.openai import AsyncOpenAI
from deepeval.tracing import trace

@observe()
async def retrieve_docs(query):
    return [
        "Paris is the capital and most populous city of France.",
        "It has been a major European center of finance, diplomacy, commerce, and science."
    ]

@observe()
async def llm_app(input):
    client = AsyncOpenAI()
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
        ),
    ):
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": '\n'.join(await retrieve_docs(input)) + "\n\nQuestion: " + input}
            ],
        )
    return response.choices[0].message.content

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="...")])

# Iterate through goldens
for golden in dataset.evals_iterator():
    # add LLM App task to test_run
    task = asyncio.create_task(llm_app(input=golden.input))
    dataset.evaluate(task)
```

</TabItem>
<TabItem value="async-responses" label="Async Responses">

```python title="main.py" showLineNumbers {4,21}
import asyncio
from deepeval.tracing import observe
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.openai import AsyncOpenAI
from deepeval.tracing import trace, LlmSpanContext

@observe()
async def retrieve_docs(query):
    return [
        "Paris is the capital and most populous city of France.",
        "It has been a major European center of finance, diplomacy, commerce, and science."
    ]

@observe()
async def llm_app(input):
    client = AsyncOpenAI()
    with trace(
        llm_span_context=LlmSpanContext(
            metrics=[AnswerRelevancyMetric(), BiasMetric()],
        ),
    ):
        response = await client.responses.create(
            model="gpt-4o",
            instructions="You are a helpful assistant.",
            input=input,
        )
    return response.output_text

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="...")])

# Iterate through goldens
for golden in dataset.evals_iterator():
    # add LLM App task to test_run
    task = asyncio.create_task(llm_app(input=golden.input))
    dataset.evaluate(task)
```

</TabItem>
</Tabs>

When used inside `@observe` components, `deepeval`'s OpenAI client automatically:

- Generates an LLM span for every OpenAI API call, including nested Tool spans for any tool invocations.
- Attaches an `LLMTestCase` to each generated LLM span, capturing inputs, outputs, and tools called.
- Records span-level llm attributes such as the input prompt, generated output and token usage.
- Logs hyperparameters such as model name and system prompt for comprehensive experiment analysis.

<div style={{ margin: "2rem 0" }}>
  <VideoDisplayer
    src="https://deepeval-docs.s3.us-east-1.amazonaws.com/integrations:frameworks:openai.mp4"
    label="OpenAI Integration"
    confidentUrl="/llm-tracing/integrations/openai"
  />
</div>

## Online Evals in Production

If your OpenAI application is in production, and you still want to run evaluations on your traces, use online evals. It lets you run evaluations on all incoming traces on Confident AI's server.

Set the `llm_metric_collection` name in the `trace` context when invoking your OpenAI client to evaluate Llm Spans.

```python main.py
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext

client = OpenAI()

with trace(
    llm_span_context=LlmSpanContext(
        metric_collection="test_collection_1",
    ),
):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, how are you?"},
        ],
    )
```
